EχATOLP – a tool for domain relevant terms extraction

نویسندگان

  • Lucelene Lopes
  • Paulo Fernandes
  • Renata Vieira
  • Guilherme Fedrizzi
  • Daniel Martins
چکیده

This paper presents a software tool to extract relevant terms from Portuguese texts. EχATOLP extracts the most frequent noun phrases in an annotated corpus. The annotation is provided by the PALAVRAS parser. The software tool offers different options to improve the quality of extraction that goes from post-treatment of the parser annotation to application of linguistic and statistical criteria. EχATOLP also offers some additional features to compare extracted terms with reference lists, to compute efficiency numerical indexes and to search for terms in the corpus. Term extraction from corpora is usually the basis of many Natural Language Processing (NLP) task such as automatic glossary construction [7], text categorization [4] and even ontology learning [3]. Term extraction, as many other NLP applications, can benefit from both linguistic and statistical approaches, as the combination of these two approaches often offers better results than each of the approaches separately. EχATOLP software tool [6] thus uses both linguistic and statistical approaches to extract and select domain significant terms from a an annotated domain corpus. From a linguistic point of view, the extraction is based on the syntactic annotation performed by the parser PALAVRAS [2]. The candidate terms are terms annotated as noun phrases by the parser according to an extra set of discard and transformation rules. From a statistical point of view, those candidate terms are subject to frequency analysis, i.e., in order to select the more frequent ones. Figure 1(a) graphically presents the software architecture. The basic input is a set of .xml files which are the files with the annotated texts of the corpus. The extraction process consider a set of discard and transformation rules to, respectively, discard some noun phrases that may be unwanted, e.g., noun phrases with numerals, or to adapt some noun phrases to the purpose of extraction, e.g., remove articles. The user can chose by the tool options which rules of these two sets are to be applied. Figure 1(b) upper screenshot presents the interface where the user can choose from all these options. Once the candidate terms are extracted, their frequencies in the corpus and in each text is computed. Then, by user choice, some of the candidate terms can be selected according to different criteria, e.g., keeping only the 10% more frequent terms in all corpus. The lists of extracted terms can be compared with reference

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

EχATOLP – An Automatic Tool for Term Extraction from Portuguese Language Corpora

This paper describes EχATOLP, a software tool to extract significant terms from an annotated corpus written in portuguese about a specific domain of interest. Being based on linguistic and statistical approaches, this tool extracts terms that are frequent and syntactic relevant to the domain of interest.

متن کامل

A term extraction tool for expanding content in the domain of functioning, disability, and health: proof of concept

Among the challenges in developing terminology systems is providing complete content coverage of specialized subject fields. This paper reports on a term extraction tool designed for the development and expansion of terminology systems concerned with functioning, disability, and health. Content relevant to this domain is the emphasis of the foci and targets of many nursing terminologies. We ext...

متن کامل

Term Extraction and Mining of Term Relations from Unrestricted Texts in the Financial Domain

In this paper, we present an unsupervised hybrid textmining approach to automatic acquisition of domain relevant terms and their relations. We deploy the TFIDFbased term classification method to acquire domain relevant terms. Further, we apply two strategies in order to learn lexico-syntatic patterns which indicate paradigmatic and domain relevant syntagmatic relations between the extracted ter...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

EXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS

Due to the explosive growth of the world-wide web, automatictext summarization has become an essential tool for web users. In this paperwe present a novel approach for creating text summaries. Using fuzzy logicand word-net, our model extracts the most relevant sentences from an originaldocument. The approach utilizes fuzzy measures and inference on theextracted textual information from the docu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010